Artem/fix gpu count by vertex451 · Pull Request #379 · akash-network/provider

vertex451 · 2026-03-23T11:45:25Z

Resolves:

Details

I haven't found a bug in how do we handle inventory events.

Also, I've found recent report about GPU driver reporting wrong values.

[Bug]: Allocatable nvidia.com/gpu=0 after v0.18.2 upgrade due to config-manager race on startup NVIDIA/k8s-device-plugin#1645

That is why I decided to add a guardrails that prevents uint underflow, also I added logging, so we can track this in the future.

Changes

Safe int64→uint64 conversion - clamp negatives to 0 instead of raw cast; guard against -1 sentinel in Available() returning MaxInt64
Sanitize all resource quantities at ingest - reject negative values from knode.Status.Allocatable/Capacity for CPU, Memory, GPU, Storage
GPU allocatable upper-bound - clamp to Capacity if Allocatable > Capacity
Pod key uses namespace/name - prevents cross-namespace accounting collision
Handle watch.Added on node watcher reconnect - was silently dropped, leaving stale allocatable
podsWatch.Stop() on restart - old code drained one event and leaked the watcher
Precision-preserving Sub() - old Value() - Value() truncated milli-precision, drifting CPU allocated toward 0

Testing

Tested locally at non-gpu machine and checked if CPU is consistently changed compared to the main branch.
Checked CPU and memory changes at each stages:
Create a provider -> Create a deployment -> Create a lease -> Send a manifest -> Delete a deployment.

coderabbitai · 2026-03-23T11:45:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Context is threaded into inventory objects; resource quantities are sanitized and clamped to zero when negative or underflowing; pod tracking keys become namespace-aware; watchers restart cleanly and node events filter by name; tests verify clamping, underflow behavior, and allocatable change detection across both inventory metrics and node discovery.

Changes

Resource Quantity Sanitization and Context Threading

Layer / File(s)	Summary
Inventory context threading and clamping setup `cluster/kube/operators/clients/inventory/client.go`, `cluster/kube/operators/clients/inventory/inventory.go`	Added `ctx context.Context` field to inventory struct; subscriber flow constructs outgoing inventory messages with request context; `newInventory` accepts and stores context; introduced `clampUint64` and `clampAvailableUint64` helpers that log and clamp negative quantities to zero; `dup()` simplified to preserve context.
Inventory metrics computation with clamping `cluster/kube/operators/clients/inventory/inventory.go`, `cluster/kube/operators/clients/inventory/client_test.go`	Metrics() derives logger from stored context and applies clamping helpers to CPU, GPU, Memory, and Ephemeral Storage totals, plus per-storage-class allocatable and available values, replacing unchecked conversions; test context wraps logger with `logr.Discard()`; new table-driven `TestInventoryMetrics_negative_quantities_clamped` streams clusters with mutated negative quantities and asserts metrics clamp all affected fields to zero, including per-storage-class and GPU available behavior.
Node resource sanitization `operator/inventory/node-discovery.go`, `operator/inventory/node-discovery_test.go`	Added logr import; introduced `sanitizeResourceQuantity` helper that clamps negative node-reported values to zero with error logs; `TestSanitizeResourceQuantity` verifies sanitization returns zero for negative and zero quantities across resource types.
Node info population and allocatable change detection `operator/inventory/node-discovery.go`, `operator/inventory/node-discovery_test.go`	`updateNodeInfo` accepts context, zero-resets allocatable and capacity fields before repopulating from node data, sanitizes negative quantities with logs, and enforces GPU allocatable ≤ GPU capacity; `initNodeInfo` calls refactored `updateNodeInfo` with context; allocatable change detection now uses `resource.Quantity.Cmp()` instead of raw value comparison; `TestNodeAllocatableChanged_detectsMilliCPUChange` verifies detection works across milli-CPU values.
Pod tracking with namespace-aware keying `operator/inventory/node-discovery.go`	New `podKey()` helper constructs namespace/name composite keys; pod map initialization and all add/modify/delete event handlers updated to use namespace-aware keys, preventing cross-namespace collisions; `restartPodsWatcher` stops and nils existing watcher instead of draining; node watch handling now filters events by node name, handles both Added and Modified events, conditionally calls `updateNodeInfo` with context, and deep-copies node after processing.
Pod resource underflow accounting with logging `operator/inventory/node-discovery.go`, `operator/inventory/node-discovery_test.go`	`subAllocatedNLZ` now accepts logger and node/resource identity, performs subtraction, detects underflow via comparison to zero, clamps to zero, and logs before/subtracted quantities; pod resource subtraction threads logging through all resource paths including emptyDir memory with dedicated resource identifier; `TestSubAllocatedNLZ_underflow_clamped` verifies underflow on subtraction is clamped to zero.

Sequence Diagram

sequenceDiagram
  participant NodeWatcher
  participant updateNodeInfo
  participant sanitizeResourceQuantity
  participant Metrics
  participant clampUint64
  NodeWatcher->>updateNodeInfo: ctx, node data
  updateNodeInfo->>updateNodeInfo: zero-reset fields
  updateNodeInfo->>sanitizeResourceQuantity: node allocatable
  alt negative value
    sanitizeResourceQuantity->>sanitizeResourceQuantity: log error
    sanitizeResourceQuantity->>updateNodeInfo: return 0
  else valid
    sanitizeResourceQuantity->>updateNodeInfo: return clamped
  end
  updateNodeInfo->>updateNodeInfo: store sanitized values
  NodeWatcher->>Metrics: compute inventory metrics
  Metrics->>clampUint64: allocatable total
  clampUint64->>Metrics: return clamped value
  Metrics->>Metrics: return metric set

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

troian

Poem

🐇 I hop through context, steady and spry,
I clip negatives down to a friendly 0, why —
Pods now wear namespaced shoes that fit,
Watchers restart clean; no more misfit! 🌿✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Artem/fix gpu count' is vague and does not clearly summarize the main changes; the PR addresses resource quantity sanitization across multiple resource types (CPU, Memory, GPU, Storage) and pod tracking improvements, not just GPU counting.	Consider a more descriptive title such as 'Sanitize resource quantities and fix pod accounting' that better reflects the comprehensive nature of the defensive handling changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is well-related to the changeset, providing detailed context about the fixes including uint underflow prevention, resource sanitization, pod key improvements, watch event handling, and precision preservation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch artem/fix-gpu-count

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

operator/inventory/node-discovery.go (1)

541-546: ⚠️ Potential issue | 🟠 Major

Refresh node quantities on every Added/Modified event.

This guard skips updateNodeInfo() for capacity-only updates, and it also misses sub-core CPU changes because nodeAllocatableChanged() relies on resource.Quantity.Value(). That leaves stale allocatable/capacity data in node, so the new sanitization/clamping path is bypassed on a normal node update. restartPodsWatcher() can stay behind change detection if you want, but updateNodeInfo() itself should run for every watch.Added/watch.Modified.

🛠️ One safe way to split the refresh from the watcher restart

 				switch evt.Type {
 				case watch.Added:
 					fallthrough
 				case watch.Modified:
-					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
-						updateNodeInfo(ctx, obj, &node)
+					updateNodeInfo(ctx, obj, &node)
+					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
 						if err = restartPodsWatcher(); err != nil {
 							return err
 						}
 					}
 					signalLabels()

In `k8s.io/apimachinery/pkg/api/resource`, does `Quantity.Value()` preserve milli-CPU precision, or does a quantity like `500m` become `0`? What comparison should be used to detect CPU allocatable changes without truncation?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@operator/inventory/node-discovery.go` around lines 541 - 546, Always call
updateNodeInfo(...) for every watch.Added or watch.Modified event instead of
skipping it for capacity-only updates; keep nodeAllocatableChanged(...) solely
to decide whether to call restartPodsWatcher(), not to gate the update. Modify
the event handling in the Added/Modified case so updateNodeInfo(ctx, obj, &node)
is invoked unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster/kube/operators/clients/inventory/client_test.go`:
- Around line 406-408: The current test only compares
m.TotalAllocatable.Storage["default"] == 0 which false-passes if the "default"
key is missing; update the check in the anonymous check func to first assert the
map contains the "default" key (e.g. require.Contains/ok :=
m.TotalAllocatable.Storage["default"]) and then assert the value equals
uint64(0), referencing the existing m.TotalAllocatable.Storage map and the
"default" key.

In `@operator/inventory/node-discovery.go`:
- Around line 718-724: The current guard skips clamping when sanitized
gpuCapacity is 0, so change the logic in the block using gpuAllocatable and
gpuCapacity (variables gpuAllocatable, gpuCapacity and the
node.Resources.GPU.Quantity.Allocatable/Capacity fields) to clamp whenever
gpuAllocatable > gpuCapacity (remove the gpuCapacity > 0 check), call
node.Resources.GPU.Quantity.Allocatable.Set(gpuCapacity) and emit the same
log.Error with errGPUExceedsCapacity and context ("node", knode.Name,
"allocatable", gpuAllocatable, "capacity", gpuCapacity) so nodes with a
normalized capacity of 0 are correctly clamped.
- Around line 685-717: updateNodeInfo currently only updates fields present in
the incoming Allocatable/Capacity maps, leaving stale values for resources that
disappeared; before the two for-loops, explicitly reset the fields this function
owns: set node.Resources.CPU.Quantity.Allocatable.SetMilli(0) and
.Capacity.SetMilli(0), node.Resources.Memory.Quantity.Allocatable.Set(0) and
.Capacity.Set(0), node.Resources.EphemeralStorage.Quantity.Allocatable.Set(0)
and .Capacity.Set(0) (or .EphemeralStorage if that field name differs), and
node.Resources.GPU.Quantity.Allocatable.Set(0) and .Capacity.Set(0); then run
the existing loops to repopulate only current values so removed extended
resources won’t leave stale non-zero entries.

---

Duplicate comments:
In `@operator/inventory/node-discovery.go`:
- Around line 541-546: Always call updateNodeInfo(...) for every watch.Added or
watch.Modified event instead of skipping it for capacity-only updates; keep
nodeAllocatableChanged(...) solely to decide whether to call
restartPodsWatcher(), not to gate the update. Modify the event handling in the
Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked
unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0116d940-8636-4609-8a76-8b97ce343688

📥 Commits

Reviewing files that changed from the base of the PR and between ba474df and b534eb1.

📒 Files selected for processing (5)

cluster/kube/operators/clients/inventory/client.go
cluster/kube/operators/clients/inventory/client_test.go
cluster/kube/operators/clients/inventory/inventory.go
operator/inventory/node-discovery.go
operator/inventory/node-discovery_test.go

Node modified events must refresh inventory state even when pod watcher restart is not required. The restart predicate now uses quantity comparison so milli-CPU allocatable changes are not missed. Signed-off-by: Joseph Chalabi <chalabi.joseph@gmail.com>

Use Kubernetes namespaced names for pod accounting keys and simplify node watch event cases. Clarify why node status refresh resets only the fields it owns. Signed-off-by: Joseph Chalabi <chalabi.joseph@gmail.com>

vertex451 commented Mar 25, 2026

View reviewed changes

Comment thread operator/inventory/node-discovery.go

vertex451 commented Mar 25, 2026

View reviewed changes

Comment thread operator/inventory/node-discovery.go

vertex451 commented Mar 25, 2026

View reviewed changes

Comment thread operator/inventory/node-discovery.go Outdated

vertex451 commented Mar 25, 2026

View reviewed changes

Comment thread operator/inventory/node-discovery.go Outdated

vertex451 marked this pull request as ready for review March 25, 2026 13:45

vertex451 requested a review from a team as a code owner March 25, 2026 13:45

vertex451 self-assigned this Mar 25, 2026

coderabbitai Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread cluster/kube/operators/clients/inventory/client_test.go

Comment thread operator/inventory/node-discovery.go

Comment thread operator/inventory/node-discovery.go

cloud-j-luna reviewed Apr 17, 2026

View reviewed changes

Comment thread operator/inventory/node-discovery.go Outdated

Comment thread operator/inventory/node-discovery.go Outdated

Comment thread operator/inventory/node-discovery.go

vertex451 added 6 commits June 20, 2026 14:27

fix: gpu count sanitizing

8b6b97e

fix: handle watch.Added event

500b464

chore: simplified code

4001ea0

chore: logger

551b573

chore: reduced boilerplate code in tests

8f947a1

fix: clamp stale resources on node update and tighten GPU capacity guard

7edb230

chalabi2 force-pushed the artem/fix-gpu-count branch from 45b5af0 to 7edb230 Compare June 20, 2026 21:27

chalabi2 added 2 commits June 20, 2026 15:08

style(inventory): address node review comments

09e0101

Use Kubernetes namespaced names for pod accounting keys and simplify node watch event cases. Clarify why node status refresh resets only the fields it owns. Signed-off-by: Joseph Chalabi <chalabi.joseph@gmail.com>

chalabi2 changed the title ~~Artem/fix gpu count~~ Fix inventory resource quantity handling Jun 21, 2026

chalabi2 changed the title ~~Fix inventory resource quantity handling~~ Artem/fix gpu count Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artem/fix gpu count#379

Artem/fix gpu count#379
vertex451 wants to merge 8 commits into
mainfrom
artem/fix-gpu-count

vertex451 commented Mar 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 23, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vertex451 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Resolves:

Details

Changes

Testing

Uh oh!

coderabbitai Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vertex451 commented Mar 23, 2026 •

edited

Loading

coderabbitai Bot commented Mar 23, 2026 •

edited

Loading